Skip to content
This repository was archived by the owner on Feb 18, 2024. It is now read-only.

Proposal#1

Closed
jorgecarleitao wants to merge 71 commits into
mainfrom
proposal
Closed

Proposal#1
jorgecarleitao wants to merge 71 commits into
mainfrom
proposal

Conversation

@jorgecarleitao
Copy link
Copy Markdown
Owner

@jorgecarleitao jorgecarleitao commented Feb 7, 2021

See README.md

@jorgecarleitao jorgecarleitao force-pushed the proposal branch 2 times, most recently from 57a129d to d16c779 Compare February 7, 2021 13:20
@elferherrera
Copy link
Copy Markdown
Contributor

Wow, I just spent the past hour going through your proposal and it looks really great. I like that you have made all the memory implementation easier to understand. The code reads very rusty. I also liked the fact that most of the things are documented and well explained. I think it does make sense to think about implementing this as the base for arrow, specially with the improvement in performance that you are reporting.

Regarding the NativeType trait and its implementations, I couldn't understand why it has to be unsafe. Do you mind explaining that to me?

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

jorgecarleitao commented Feb 11, 2021

I am thinking that while this is a major change for the implementation of Arrow, most of the APIs changed are what I would think of as implementation details.

The main difference happens in interactions with lower-end functionality, yes. But AFAIK folks at UrbanLogic (@maxburke) use them.

For higher-end functionality, the main change is that creating a primitive from an iterator has some more characters:

let array = iter.map(...).collect::<Primitive<i32>>().to(DataType::Date32);

vs

let array = iter.map(...).collect::<Date32Array>();

this is derived by the split between logical and physical parts of the array, so that we can e.g. create timezone-aware timestamps:

let array = iter.map(...).collect::<Primitive<i64>>().to(DataType::Timestamp(a, b));

which is not possible in current arrow without performing a cast the array or using ArrayData directly.

What do you have in mind for next steps?

  1. evaluate if this covers all the necessary aspects for it to fly. I am adding some kernels on this new format, validate UX and coverage of requirements.

  2. replace Buffer by Buffer<u8>. This has no semantic change but allow us to prepare the code to have typed buffers.

  3. create all the traits required to read all attributes of all arrays, and implement them for all arrays.

  4. Replace Array::data_ref/data by the new traits. The idea here is that we will continue to create arrays using ArrayData, but we do not use them to access data and instead rely on specific, type-specific methods for it. This requires a refactor of the transform and equal modules as well as ffi export, IPC writer and parquet writer (i.e. everything that is type-agnostic and relies on ArrayData).

  5. Change array creation to not rely on ArrayData. After 4, code no longer relies reads of ArrayData and thus we can drop it entirely. This requires major changes to the Builder API, as well as anything that creates arrays, such as ffi import, IPC reader, parquet reader, etc, as they create arrays.

As you can see this really has a major impact in the crate as a whole, which is why is so hard to implement.

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

Regarding the NativeType trait and its implementations, I couldn't understand why it has to be unsafe. Do you mind explaining that to me?

The arrow specification assumes that a buffer can only be of certain types, and requires specific memory alignments for these. We use that to safely transfer data over FFI boundaries, write to parquet, IPC, etc. Furthermore, our allocator has an optimization on which we assume that NativeType do not contain any pointers.

Thus,

struct A(HashMap<i32, i32>);

impl NativeType for A {};

would leak and would also result in undefined behavior. Thus, we mark the trait as unsafe to tell everyone "look, do not implement this trait without careful consideration about what you are doing with it: the arrow crate is not prepared to handle arbitrary structs". In the arrow crate it is not marked as such but it should be.

@elferherrera
Copy link
Copy Markdown
Contributor

elferherrera commented Feb 11, 2021

would leak and would also result in undefined behavior. Thus, we mark the trait as unsafe to tell everyone "look, do not
implement this trait without careful consideration about what you are doing with it: the arrow crate is not prepared to handle arbitrary structs". In the arrow crate it is not marked as such but it should be.

Gotcha. So it is a warning to someone wanting the implement the trait to another type. Yesterday I was playing with your code and saw that I could remove the unsafe option from the trait and it would compile.

Out of curiosity, would it not make sense to keep the trait hidden (not pub) instead of using unsafe?

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

So it is a warning to someone wanting the implement the trait to another type. Yesterday I was playing with your code and saw that I could remove the unsafe option from the trait and it would compile.

Yes, that is the main use-case of unsafe. adding or removing unsafe does not have affect the produced binary; it is a keyword only used to flag errors. It works a bit like std::marker traits.

Out of curiosity, would it not make sense to keep the trait hidden (not pub) instead of using unsafe?

I agree with this sentiment. However, that would not allow people to write generics that depend on it. It is also not possible in Rust to publicly expose generic functions whose type parameters are not public, as it would "leak" a private trait via a public function. There are some discussions to allow a trait to be "sealed", but it is still in RFC phase, which I think would solve this problem.

@elferherrera
Copy link
Copy Markdown
Contributor

I guess the sealed traits could be implemented like this example
https://rustype.github.io/notes/notes/rust-typestate-series/rust-typestate-part-1.html#stricter-states

@ritchie46
Copy link
Copy Markdown
Collaborator

Very interesting read and I really itches something that I experienced when starting using Arrow, but got used to overtime due to exposure. What is the main idea, fork and continue as arrow2? Or make a MVP and hope to get that merged into the main project?

@alamb
Copy link
Copy Markdown
Collaborator

alamb commented Feb 13, 2021

I hope we don't end up with a fork. While it will be painful I think starting to break this PR up into pieces and bring it in incrementally into the main arrow codebase would be the most plausible way to bring the idea to fruition.

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

I also hope so, and I am working towards a plan to have this merged on the main repo.

Technically, the core hypothesis that I am testing atm is that the way this repo handles offsets works in integration with at least IPC integration tests. I am really uncertain here.

This PR uses a different approach to offsets, as they are no longer tracked only on ArrayData and instead each Buffer/Bitmap tracks its own, the same way Tokyo-bytes does it. This is dramatically easier to work with, but I am unsure whether it will work with IPC and FFI. If it does not work, I may have to revisit this whole thing. I working towards having the json IO migrated to this repo, as the integration tests for IPC use json.

@elferherrera
Copy link
Copy Markdown
Contributor

Another thing that could help to get traction behind your idea is to have more performance comparisons between actual Arrow and the new implementation. I know you have put a lot of time into this, so if you would like help testing something let me know and I can help with that. A list of the things that you want to test could be useful se we can work on that.

Sorry to piggy-bag on this thread to ask something related to your NativeType trait implementation, but would using a private module (example) have the safe effect you are looking for sealing the traits?

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

jorgecarleitao commented Feb 28, 2021

Sorry for the late reply, but last weeks have been pretty busy. However, I did have time to work on this.

I have now concluded the feasibility study that I wanted to do on this.

  • I was able to make all IPC integration tests for both file and stream pass, both for little endian and big endian (which we do not even support in master atm).
  • I was unable to run read parquet integration tests because we have none: we only test that we read the files, not that the result is correct.
    • I used pyarrow to get expected values to some of the parquet files we have available, but AFAIK they do not contain nulls, so I was unable to test the validity stuff. This approach is also not scalable.
    • I was able to make this fork parse primitive types from parquet files
    • I have not tested the write, mostly because we have no baseline to compare against
    • I was unable to correctly read variable-sized arrays from parquet. I am not sure if this is due to my changes or an issue on master already (AFAIK we do not test that).

I thus conclude that the biggest risk for this endeavor is regressions on the parquet IO.

Recommended actions:

  • Park this until we have sufficient coverage for reading and writing parquet files to perform a migration.
  • Raise the possibility in the mailing list of creating golden parquet files and corresponding .arrow files (e.g. read the parquet from the c++ implementation and write .arrow), so that Rust (and others) can read them and compare against the respective .arrow to verify that the in-memory matches. This will give us IO parity across implementations, which in our case is necessary to validate that the readers and writers are consistent across languages.

@alamb
Copy link
Copy Markdown
Collaborator

alamb commented Feb 28, 2021

golden parquet files and corresponding .arrow

I think this is an excellent idea.

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

I am closing this PR, as I will start the work of making this repo usable and stuff.

The ideas above hold, but I plan to use this repo to have the version of the code with a transmute-free implementation of arrow.

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

The repo now contains the design and implementation that I have been baking over the past months. Some modules have a README with the actual design notes of them (i.e. MUST, MAY, SHOULD, etc).

The repo has most things implemented with the notable exception of parquet IO, which I am still trying to grasp.

I feature-gated almost everything so that the crate depends on 3 small dependencies and chrono.

Many, many components were re-written from scratch because they left me no other choice. I also deprecated the "Builder" API, as it is entirely replaced by a FromIter equivalent that supports arbitrary (typed) nesting.

The specification is always validated when an array is created and there is little room for unsoundness.

@alamb
Copy link
Copy Markdown
Collaborator

alamb commented Mar 8, 2021

😮 - so now the next question is, "what next and what can we do to help @jorgecarleitao "?

@jorgecarleitao jorgecarleitao deleted the proposal branch August 22, 2021 21:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants